# Create an image dataset

There are two methods for creating and sharing an image dataset. This guide will show you how to:

* Create an image dataset with `ImageFolder` and some metadata. This is a no-code solution for quickly creating an image dataset with several thousand images.
* Create an image dataset by writing a loading script. This method is a bit more involved, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale image datasets.

<Tip>

You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.

</Tip>

## ImageFolder

The `ImageFolder` is a dataset builder designed to quickly load an image dataset with several thousand images without requiring you to write any code.

<Tip>

💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `ImageFolder` creates dataset splits based on your dataset repository structure.

</Tip>

`ImageFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:

```
folder/train/dog/golden_retriever.png
folder/train/dog/german_shepherd.png
folder/train/dog/chihuahua.png

folder/train/cat/maine_coon.png
folder/train/cat/bengal.png
folder/train/cat/birman.png
```

Then users can load your dataset by specifying `imagefolder` in [`load_dataset`] and the directory in `data_dir`:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
```

You can also use `imagefolder` to load datasets involving multiple splits. To do so, your dataset directory should have the following structure:

```
folder/train/dog/golden_retriever.png
folder/train/cat/maine_coon.png
folder/test/dog/german_shepherd.png
folder/test/cat/bengal.png
```

<Tip warning={true}>

If all image files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.

</Tip>


If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl`.

```
folder/train/metadata.csv
folder/train/0001.png
folder/train/0002.png
folder/train/0003.png
```

You can also zip your images:

```
folder/metadata.csv
folder/train.zip
folder/test.zip
folder/valid.zip
```

Your `metadata.csv` file must have a `file_name` column which links image files with their metadata:

```csv
file_name,additional_feature
0001.png,This is a first value of a text feature you added to your images
0002.png,This is a second value of a text feature you added to your images
0003.png,This is a third value of a text feature you added to your images
```

or using `metadata.jsonl`:

```jsonl
{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}
```

<Tip>

If metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set `drop_labels=False` in `load_dataset`.

</Tip>

### Image captioning

Image captioning datasets have text describing an image. An example `metadata.csv` may look like:

```csv
file_name,text
0001.png,This is a golden retriever playing with a ball
0002.png,A german shepherd
0003.png,One chihuahua
```

Load the dataset with `ImageFolder`, and it will create a `text` column for the image captions:

```py
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["text"]
"This is a golden retriever playing with a ball"
```

### Object detection

Object detection datasets have bounding boxes and categories identifying objects in an image. An example `metadata.jsonl` may look like:

```jsonl
{"file_name": "0001.png", "objects": {"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}}
{"file_name": "0002.png", "objects": {"bbox": [[810.0, 100.0, 57.0, 28.0]], "categories": [1]}}
{"file_name": "0003.png", "objects": {"bbox": [[160.0, 31.0, 248.0, 616.0], [741.0, 68.0, 202.0, 401.0]], "categories": [2, 2]}}
```

Load the dataset with `ImageFolder`, and it will create a `objects` column with the bounding boxes and the categories:

```py
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["objects"]
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}
```

### Upload dataset to the Hub

Once you've created a dataset, you can share it to the Hub with the [`~datasets.DatasetDict.push_to_hub`] method. Make sure you have the [huggingface_hub](https://huggingface.co/docs/huggingface_hub/index) library installed and you're logged in to your Hugging Face account (see the [Upload with Python tutorial](upload_dataset#upload-with-python) for more details).

Upload your dataset with [`~datasets.DatasetDict.push_to_hub`]:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset.push_to_hub("stevhliu/my-image-captioning-dataset")
```

## WebDataset

The [WebDataset](https://github.com/webdataset/webdataset) format is based on TAR archives and is suitable for big image datasets.
Indeed you can group your images in TAR archives (e.g. 1GB of images per TAR archive) and have thousands of TAR archives:

```
folder/train/00000.tar
folder/train/00001.tar
folder/train/00002.tar
...
```

In the archives, each example is made of files sharing the same prefix:

```
e39871fd9fd74f55.jpg
e39871fd9fd74f55.json
f18b91585c4d3f3e.jpg
f18b91585c4d3f3e.json
ede6e66b2fb59aab.jpg
ede6e66b2fb59aab.json
ed600d57fcee4f94.jpg
ed600d57fcee4f94.json
...
```

You can put your images labels/captions/bounding boxes using JSON or text files for example.

For more details on the WebDataset format and the python library, please check the [WebDataset documentation](https://webdataset.github.io/webdataset).

Load your WebDataset and it will create on column per file suffix (here "jpg" and "json"):

```python
>>> from datasets import load_dataset

>>> dataset = load_dataset("webdataset", data_dir="/path/to/folder", split="train")
>>> dataset[0]["json"]
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}
```

## Loading script

Write a dataset loading script to share a dataset. It defines a dataset's splits and configurations, and handles downloading and generating a dataset. The script is located in the same folder or repository as the dataset and should have the same name.

```
my_dataset/
├── README.md
├── my_dataset.py
└── data/  # optional, may contain your images or TAR archives
```

This structure allows your dataset to be loaded in one line:

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset("path/to/my_dataset")
```

This guide will show you how to create a dataset loading script for image datasets, which is a bit different from <a class="underline decoration-green-400 decoration-2 font-semibold" href="./dataset_script">creating a loading script for text datasets</a>. You'll learn how to:

* Create a dataset builder class.
* Create dataset configurations.
* Add dataset metadata.
* Download and define the dataset splits.
* Generate the dataset.
* Generate the dataset metadata (optional).
* Upload the dataset to the Hub.

The best way to learn is to open up an existing image dataset loading script, like [Food-101](https://huggingface.co/datasets/food101/blob/main/food101.py), and follow along!

<Tip>

To help you get started, we created a loading script [template](https://github.com/huggingface/datasets/blob/main/templates/new_dataset_script.py) you can copy and use as a starting point!

</Tip>

### Create a dataset builder class

[`GeneratorBasedBuilder`] is the base class for datasets generated from a dictionary generator. Within this class, there are three methods to help create your dataset:

* `info` stores information about your dataset like its description, license, and features.
* `split_generators` downloads the dataset and defines its splits.
* `generate_examples` generates the images and labels for each split.

Start by creating your dataset class as a subclass of [`GeneratorBasedBuilder`] and add the three methods. Don't worry about filling in each of these methods yet, you'll develop those over the next few sections:

```py
class Food101(datasets.GeneratorBasedBuilder):
    """Food-101 Images dataset"""

    def _info(self):

    def _split_generators(self, dl_manager):

    def _generate_examples(self, images, metadata_path):
```

#### Multiple configurations

In some cases, a dataset may have more than one configuration. For example, if you check out the [Imagenette dataset](https://huggingface.co/datasets/frgfm/imagenette), you'll notice there are three subsets. 

To create different configurations, use the [`BuilderConfig`] class to create a subclass for your dataset. Provide the links to download the images and labels in `data_url` and `metadata_urls`:

```py
class Food101Config(datasets.BuilderConfig):
    """Builder Config for Food-101"""
 
    def __init__(self, data_url, metadata_urls, **kwargs):
        """BuilderConfig for Food-101.
        Args:
          data_url: `string`, url to download the zip file from.
          metadata_urls: dictionary with keys 'train' and 'validation' containing the archive metadata URLs
          **kwargs: keyword arguments forwarded to super.
        """
        super(Food101Config, self).__init__(version=datasets.Version("1.0.0"), **kwargs)
        self.data_url = data_url
        self.metadata_urls = metadata_urls
```

Now you can define your subsets at the top of [`GeneratorBasedBuilder`]. Imagine you want to create two subsets in the Food-101 dataset based on whether it is a breakfast or dinner food.

1. Define your subsets with `Food101Config` in a list in `BUILDER_CONFIGS`.
2. For each configuration, provide a name, description, and where to download the images and labels from.

```py
class Food101(datasets.GeneratorBasedBuilder):
    """Food-101 Images dataset"""
 
    BUILDER_CONFIGS = [
        Food101Config(
            name="breakfast",
            description="Food types commonly eaten during breakfast.",
            data_url="https://link-to-breakfast-foods.zip",
            metadata_urls={
                "train": "https://link-to-breakfast-foods-train.txt", 
                "validation": "https://link-to-breakfast-foods-validation.txt"
            },
        ,
        Food101Config(
            name="dinner",
            description="Food types commonly eaten during dinner.",
            data_url="https://link-to-dinner-foods.zip",
            metadata_urls={
                "train": "https://link-to-dinner-foods-train.txt", 
                "validation": "https://link-to-dinner-foods-validation.txt"
            },
        )...
    ]
```

Now if users want to load the `breakfast` configuration, they can use the configuration name:

```py
>>> from datasets import load_dataset
>>> ds = load_dataset("food101", "breakfast", split="train")
```

### Add dataset metadata

Adding information about your dataset is useful for users to learn more about it. This information is stored in the [`DatasetInfo`] class which is returned by the `info` method. Users can access this information by:

```py
>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder("food101")
>>> ds_builder.info
```

There is a lot of information you can specify about your dataset, but some important ones to include are:

1. `description` provides a concise description of the dataset.
2. `features` specify the dataset column types. Since you're creating an image loading script, you'll need to include the [`Image`] feature.
3. `supervised_keys` specify the input feature and label.
4. `homepage` provides a link to the dataset homepage.
5. `citation` is a BibTeX citation of the dataset.
6. `license` states the dataset's license.

<Tip>

You'll notice a lot of the dataset information is defined earlier in the loading script which makes it easier to read. There are also other [`~Datasets.Features`] you can input, so be sure to check out the full list for more details.

</Tip>

```py
def _info(self):
    return datasets.DatasetInfo(
        description=_DESCRIPTION,
        features=datasets.Features(
            {
                "image": datasets.Image(),
                "label": datasets.ClassLabel(names=_NAMES),
            }
        ),
        supervised_keys=("image", "label"),
        homepage=_HOMEPAGE,
        citation=_CITATION,
        license=_LICENSE,
        task_templates=[ImageClassification(image_column="image", label_column="label")],
    )
```

### Download and define the dataset splits

Now that you've added some information about your dataset, the next step is to download the dataset and generate the splits.

1. Use the [`DownloadManager.download`] method to download the dataset and any other metadata you'd like to associate with it. This method accepts:

    * a name to a file inside a Hub dataset repository (in other words, the `data/` folder)
    * a URL to a file hosted somewhere else
    * a list or dictionary of file names or URLs
    
    In the Food-101 loading script, you'll notice again the URLs are defined earlier in the script.

2. After you've downloaded the dataset, use the [`SplitGenerator`] to organize the images and labels in each split. Name each split with a standard name like: `Split.TRAIN`, `Split.TEST`, and `SPLIT.Validation`. 

    In the `gen_kwargs` parameter, specify the file paths to the `images` to iterate over and load. If necessary, you can use [`DownloadManager.iter_archive`] to iterate over images in TAR archives. You can also specify the associated labels in the `metadata_path`. The `images` and `metadata_path` are actually passed onto the next step where you'll actually generate the dataset.

<Tip warning={true}>

To stream a TAR archive file, you need to use [`DownloadManager.iter_archive`]! The [`DownloadManager.download_and_extract`] function does not support TAR archives in streaming mode.

</Tip>

```py
def _split_generators(self, dl_manager):
    archive_path = dl_manager.download(_BASE_URL)
    split_metadata_paths = dl_manager.download(_METADATA_URLS)
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={
                "images": dl_manager.iter_archive(archive_path),
                "metadata_path": split_metadata_paths["train"],
            },
        ),
        datasets.SplitGenerator(
            name=datasets.Split.VALIDATION,
            gen_kwargs={
                "images": dl_manager.iter_archive(archive_path),
                "metadata_path": split_metadata_paths["test"],
            },
        ),
    ]
```

### Generate the dataset

The last method in the [`GeneratorBasedBuilder`] class actually generates the images and labels in the dataset. It yields a dataset according to the stucture specified in `features` from the `info` method. As you can see, `generate_examples` accepts the `images` and `metadata_path` from the previous method as arguments.

<Tip warning={true}>

To stream a TAR archive file, the `metadata_path` needs to be opened and read first. TAR files are accessed and yielded sequentially. This means you need to have the metadata information in hand first so you can yield it with its corresponding image.

</Tip>

Now you can write a function for opening and loading examples from the dataset:

```py
def _generate_examples(self, images, metadata_path):
    """Generate images and labels for splits."""
    with open(metadata_path, encoding="utf-8") as f:
        files_to_keep = set(f.read().split("\n"))
    for file_path, file_obj in images:
        if file_path.startswith(_IMAGES_DIR):
            if file_path[len(_IMAGES_DIR) : -len(".jpg")] in files_to_keep:
                label = file_path.split("/")[2]
                yield file_path, {
                    "image": {"path": file_path, "bytes": file_obj.read()},
                    "label": label,
                }
```

### Generate the dataset metadata (optional)

The dataset metadata can be generated and stored in the dataset card (`README.md` file).

Run the following command to generate your dataset metadata in `README.md` and make sure your new loading script works correctly:

```bash
datasets-cli test path/to/<your-dataset-loading-script> --save_info --all_configs
```

If your loading script passed the test, you should now have the `dataset_info` YAML fields in the header of the `README.md` file in your dataset folder.

### Upload the dataset to the Hub

Once your script is ready, [create a dataset card](./dataset_card) and [upload it to the Hub](./share).

Congratulations, you can now load your dataset from the Hub! 🥳

```py
>>> from datasets import load_dataset
>>> load_dataset("<username>/my_dataset")
```
